Targeted Gene Metagenomic Data Analysis ◾ 257
where p j l
i l q l
i
(
)
( )
( )
( )
→
,
is the transition probability between aligned nucleotides j l( ) and
i l( ) and the associated quality score q l
i( ), e.g., p(T→G, 40).
The divisive partitioning algorithm begins with all amplicon reads in a single parti-
tion. The error rate is then used to model the number of observed reads of each unique
sequence to compute the p-value of the hypothesis that the number of amplicons of each
unique sequence is consistent with the error model. According to the DADA2 model, for
the unique sequence i with abundance ai be in partition j containing nj reads, the abun-
dance p-value is given as
p
j
i
p
n
p
n
a
A
j
ji
a a
j
ji
i∑
λ
λ
(
)
(
)
(
)
→
= −
∞
=
1
1
,0
,
poisson
poisson
(7.2)
These p-values of the unique sequences are used as the division criteria for an iterative
partitioning. A threshold is specified for partition; if the smallest abundance p-value falls
below the threshold, a new partition is formed with that unique sequence allowing other
similar unique sequences to join it. The division continues iteratively until all unique
sequences falling within a OTUs are consistent with abundance p-values greater than the
specified threshold.
The output of the divisive amplicon denoising algorithm is a collection of ASVs, which
are exact sequences with defined statistical confidence. Because ASVs are exact sequences,
generated without clustering or reference databases, they can be readily compared between
studies using the same target region. DADA2 pipeline generates an ASV table that can be
used for downstream analysis.
7.2.2.2.2 Deblur Denoising
Deblur [9] is a denoising method that uses error profiles of amplicons sequenced by
Illumina MiSeq and HiSeq sequencing platforms to infer error-free sequences. Unlike
DADA2, Deblur operates on each sample independently. The Deblur algorithm begins by
comparing the pairwise Hamming distances of all sequences within a sample to an upper-
bound error profile. The unique sequences are sorted by abundance by ascending from
the most to the least. Neighboring reads are formed for each read based on a Hamming
distance threshold. The number of incorrect reads is then subtracted from the abundance
of the neighboring reads using an upper bound on the error probability. After subtraction,
the sequences with zero abundance are considered as a noise and dropped from the list
of the valid sequences. Deblur can infer the correct sequences. However, it may decline to
remove PCR chimeras that are produced from the aborted PCR cycle.
7.2.2.2.3 UNOISE2 Denoising
Unlike DADA2, UNOISE [10] denoising does not use quality score and it utilizes one-
pass clustering strategy with only two parameters (α and β) with pre-set values. A unique
read sequence (M) in a cluster is evaluated based on its Levenshtein distance (d) from